AFRL-SR-BL-TR-98- 


REPORT  DOCUMENTATION  P 


w 


KoMrp«rr 


ing«KMiio<kU 
Of 

OCMf  oaoocc  Of  m^m  oqmcckhi  ««  wnrngrmmaon,  KKauamo  burdofl  lO  WooMngloa  Hoodqyortm 

and  PUo^,  1215  0<Fvii  higtrwy,  Sulw  1204,  Artingio<v  VA  22202-002,  mnd  to  th»  Offtoo  o<  mmmQmtmfi  and  Ba4g<  fWdtJcOon  ProH^t  (0704^  M), 

WMhlnolon,OC20503. 


1.  AGENCY  USE  ONLY  (Luv  bUnk) 

2.  REPORT  DATE 

11/24/97 

a  REPORT  TYPE  AND  DATES  COVERED 

Final  Technical  Report; 9/30/93-9/29/97 

4.  TITLE  AND  SUBTITLE 

Building  Information  Servers 

SIMS-AA  Technical  Report 

6.  FUNOINQ  NUMBERS 

C=F49620-93-1-0594 

1 

6.  AinHOR(S) 

Craig  A.Knoblock,  William  Swartout,  and  Sheila  Tejada 

7.  PERFORMING  ORGAWIZATION  NAME(S)  AND  ADDRESS(ES) 

use  INFORMATION  SCIENCES  INSTITUTE 

4676  ADMIRALTY  WAY 

MARINA  DEL  REY,  CA  90292-6695 

t.  PERFORMING  ORGANIZATON 

REPORT  NUMBER  _ 

a  SPONSORING/MONITORINC  AGENCY  NAMES(S)  AND  AOORESS(ES) 

Defense  Advanced  Research  Projects  Agency 

3701  No. Fairfax  Drive  \  ^ 

Arlington,  VA  22203  .  * | 

10.  SPONSORINGmONfTORING 

AGENCY  REPORT  NUMBER 

noAAU  flW 

11,  SUPPLEMENTARY  NOTES 


12A.  tXSTRIBUnON/AVAJLABiUTY  STATEMENT  12B,  CXSTRlBUnON  CODE 

UNCIj\SSIFIED/UNIJ^^^ _ ‘ _ 

IX  ABSTRACT  (iUximum  200  woftU) 


This  rcs«^ch  addressed  the  problem  of  determining  the  relauonships  among  multiple,  diverse  information 
sources  in  order  to  support  the  integration  of  data  from  these  sources.  In  general,  to  integrate  data  from 
multiple  sources  requires  a  model  of  the  precise  relationships  between  the  sources.  Constructing  such  a 
model  by  hand  is  a  difficult  and  time-consuming  process.  The  relationships  captured  in  a  model  describe  the 
type  of  overlap  between  data  instances  in  different  sources.  In  this  work  data  mining  techniques  were  used 
to  determine  these  relationships  by  comparing  the  data  instances  between  sources.  A  related  problem  is  that 
data  instances  can  exist  in  different  formats  across  several  sources,  e.g.  IBM  may  be  abbreviated  as  IBM  in 
one  source  and  appear  as  International  Business  Machines  in  another  source.  This  work  addressed  this 
problem  by  developing  techniques  for  automaUcally  determining  the  mapping  between  names  used  in 
different  sources.  These  integration  techniques  were  use  in  conjunction  with  the  SIMS  information 
mediator,  allowing  SIMS  to  correcUy  and  efficiently  integrate  data  across  several  sources  that  contained  data 
instances  appearing  in  multiple  formats. 
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AASERT  97  Final  Report 

This  AASERT  research  was  driven  by  the  need  to  automate  the  integration  of  information 
from  multiple  diverse  information  sources  available  on  the  internet,  in  order  to  efficiently 
answer  user  queries.  For  example,  if  the  user  was  interested  in  finding  out  the  annual 
incomes  of  all  the  computer  companies  whose  current  stock  price  is  greater  than  $100, 
this  would  involve  accessing  and  integrating  information  from  the  Securities  and  Exchange 
Commission  (SEC)  web  server,  which  has  company  annual  reports,  and  a  current  stock 
quote  server  available  on  the  web.  To  manually  perform  this  task  a  user  would  first  retrieve 
the  annual  incomes  of  all  of  the  computer  companies  from  the  SEC  web  site,  and  then 
retrieve  the  current  stock  price  of  each  of  these  companies  from  the  stock  quote  server  and 
check  if  the  price  is  greater  than  $  100.  This  task  would  require  a  significant  amount  of 
effort  to  perform  on  the  part  of  the  user,  especially  when  retrieving  large  amounts  of  data. 
The  user  would  also  need  to  possess  a  ^eat  deal  of  knowledge  about  -how  to  access  the 
data  contained  in  the  sources,  as  well  as  how  to  integrate  the  two  sets  of  data.  A  more 
desirable  approach  would  be  to’ provide  the  usef  With'a  single  interface  that  hllows  access  to 
multiple  information  sources,  abstracting  away  the  need  for  the  user  to  know  the  location  or 
query  access  methods  of  any  particular  source. 

The  SIMS  information  broker  was  designed  to  provide  the  user  with  such  an 
interface.  Therefore,  as  shown  by  this  example,  in  order  to  properly  access  and  integrate 
the  data  from  independent  heterogeneous  sources  the  system  must  have  knowledge  about 
the  relationships  of  the  data  contained  in  each  of  these  sources,  as  well  as  the  relationships 
between  the  sources.  SIMS  captures  this  type  of  relationship  information  in  the  form  of  a 
model,  called  a  domain  model.  The  types  of  relationships  captured  in  the  model  describe 
the  amount  of  overlapping  data  instances  shared  between  the  sources.  Presently,  the 
domain  model  is  manually  generated  by  human  experts  who  are  faimliar  with  the  data 
stored  in  the  sources.  To  automatically  generate  domain  models,  datamining  teclmiques  are 
used  to  determine  which  data  instances  appear  in  multiple  sources,  e.g.  winch  (if  any) 
companies  from  the  SEC  web  site,  like  the  computer  company  IBM,  also  have  stock  quote 
information  in  the  stock  quote  server.  Once  the  overlapping  data  instances  are  determined, 
the  relationship  between  the  data  in  the  sources  can  be  modeled  in  the  domain  model  as 
either  subset,  superset,  equality,  overlapping  or  association. 

Since  datamining  techiuques  determine  these  relationships  by  comparing  the  data 
instances  between  sources  to  discover  which  data  instances  are  shared  between  the  sources; 
potentially,  all  instances  from  one  source  could  be  compared  with  all  of  the  instances  from 
an  other  source.  In  order  to  constrain  the  number  of  comparisons  each  source  is  first 
mined  for  the  properties  or  features  of  its  data,  such  as  its  type,  length  and  range;  then , 
only  instances  which  are  have  compatible  properties  are  compared.  The  experimental 
results  showed  that  this  technique  reduced  the  number  of  comparisons  performed  when 
mining  the  source  for  the  model  relationships,  and  that  in  some  cases  the  number  of 
comparisons  were  reduced  by  70  i^rcent. 

A  special  case  for  information  integration  is  when  data  instances  can  exist  in 
different  formats  across  several  sources,  e.g.  the  company  IBM  can  appear  as  International 
Business  Machines  in  another  source.  In  this  case,  cons^cting  a  model  that  represents 
only  the  amount  of  overlap  between  sources  is  not  sufficient  to  properly  integrate  the 
retrieved  data.  Information  relating  each  specific  pair  of  corresponding  data  instances  must 
also  be  captured,  e.g.  (IBM,  International  Business  Machines).  This  information  is  stored 
in  a  mapping  table  which  is  modeled  as  an  information  source  with  overlapping 
relationships  between  the  sources  for  which  it  contains  mapping  information.  In  other 
words,  the  mapping  table  source  has  subset  relationships  with  Ae  SEC  and  stock  seiVer 
sources.  This  integration  technique  has  allowed  SIMS  to  properly  and  efficiently  integrate 
data  across  several  sources  that  contained  data  instances  appearing  in  multiples  formats. 
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